Search CORE

22 research outputs found

GABAC : An arithmetic coding solution for genomic data

Author: Bliss Brian
Fostier Jan
Hernaez Mikel
Mainzer Liudmila S.
Müntefering Fabian
Ochoa Idoia
Ostermann Jörn
Paridaens Tom
Voges Jan
Yang Mingyu
Publication venue: Oxford : Oxford Univ. Press
Publication date: 01/01/2020
Field of study

Motivation: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. Results: We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. © 2019 The Author(s). Published by Oxford University Press

Crossref

Ghent University Academic Bibliography

Institutionelles Repositorium der Leibniz Universität Hannover

Design considerations for workflow management systems use in production genomics research and the clinic

Author: Ahmed Azza E.
Allen Joshua M.
Bhat Tajesvi
Burra Prakruthi
Fadlelmola Faisal M.
Fliege Christina E.
Hart Steven N.
Heldenbrand Jacob R.
Hudson Matthew E.
Istanto Dave Deandre
Kalmbach Michael T.
Kapraun Gregory D.
Kendig Katherine I.
Kendzior Matthew Charles
Klee Eric W.
Mainzer Liudmila S.
Mattson Nate
Ross Christian A.
Sharif Sami M.
Venkatakrishnan Ramshankar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2021
Field of study

Abstract The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Directory of Open Access Journals

Dissertations of the University of Groningen

Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models

Author: Liudmila S. Mainzer (3366242)
Matthew E. Hudson (123782)
Matthew R. Weber (3366239)
Morgan Taschuk (226628)
Ravishankar K. Iyer (2620330)
Zachary D. Stephens (766580)
Publication venue
Publication date: 01/01/2016
Field of study

<div><p>An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at <a href="http://github.com/zstephens/neat-genreads" target="_blank">github.com/zstephens/neat-genreads</a>.</p></div

Directory of Open Access Journals

PubMed Central

FigShare

Overview of mutation and sequencing model generation.

Author: Liudmila S. Mainzer (3366242)
Matthew E. Hudson (123782)
Matthew R. Weber (3366239)
Morgan Taschuk (226628)
Ravishankar K. Iyer (2620330)
Zachary D. Stephens (766580)
Publication venue
Publication date
Field of study

<p>Overview of mutation and sequencing model generation.</p

FigShare

SNP substitution frequency matrices for Leukemia model.

Author: Liudmila S. Mainzer (3366242)
Matthew E. Hudson (123782)
Matthew R. Weber (3366239)
Morgan Taschuk (226628)
Ravishankar K. Iyer (2620330)
Zachary D. Stephens (766580)
Publication venue
Publication date
Field of study

<p>SNP substitution frequency matrices for Leukemia model.</p

FigShare

SNP substitution frequency matrices for breast cancer model.

Author: Liudmila S. Mainzer (3366242)
Matthew E. Hudson (123782)
Matthew R. Weber (3366239)
Morgan Taschuk (226628)
Ravishankar K. Iyer (2620330)
Zachary D. Stephens (766580)
Publication venue
Publication date
Field of study

<p>The label for each 4 × 4 matrix specifies the nucleotide immediately preceding and following the SNP position. For example, row 3 column 2 of the “A_A” matrix specifies the frequency of AGA mutating into ACA, as observed in the breast cancer SSM dataset.</p

FigShare

Comparison of read simulator features.

Author: Liudmila S. Mainzer (3366242)
Matthew E. Hudson (123782)
Matthew R. Weber (3366239)
Morgan Taschuk (226628)
Ravishankar K. Iyer (2620330)
Zachary D. Stephens (766580)
Publication venue
Publication date
Field of study

<p>Comparison of read simulator features.</p

FigShare

Overview of NEAT Read Simulator.

Author: Liudmila S. Mainzer (3366242)
Matthew E. Hudson (123782)
Matthew R. Weber (3366239)
Morgan Taschuk (226628)
Ravishankar K. Iyer (2620330)
Zachary D. Stephens (766580)
Publication venue
Publication date
Field of study

<p>Overview of NEAT Read Simulator.</p

FigShare

Empirical insert size distribution from two example BAM files.

Author: Liudmila S. Mainzer (3366242)
Matthew E. Hudson (123782)
Matthew R. Weber (3366239)
Morgan Taschuk (226628)
Ravishankar K. Iyer (2620330)
Zachary D. Stephens (766580)
Publication venue
Publication date
Field of study

<p>(Left) ICGC donor DO35138: <a href="http://dcc.icgc.org/donors/DO35138" target="_blank">dcc.icgc.org/donors/DO35138</a>, (Right) ICGC donor DO221544: <a href="http://dcc.icgc.org/donors/DO221544" target="_blank">dcc.icgc.org/donors/DO221544</a>, both from project PACA-CA.</p

FigShare

SNP substitution frequency matrices for Melanoma model.

Author: Liudmila S. Mainzer (3366242)
Matthew E. Hudson (123782)
Matthew R. Weber (3366239)
Morgan Taschuk (226628)
Ravishankar K. Iyer (2620330)
Zachary D. Stephens (766580)
Publication venue
Publication date
Field of study

<p>Note the strong preference for G → A and C → T transitions, as observed in existing work [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0167047#pone.0167047.ref017" target="_blank">17</a>].</p

FigShare